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1  Introduction 

Due  to  advances  in  integration  technology  the  use  of  asynchronous  circuits  has 
become  increasingly  interesting.  Design  methods  have  emerged  with  which  it  is 
manageable  to  design  efficient  and  reliable  asynchronous  circuits. 

Instead  of  designing  circuits  under  worst  case  assumptions  as  for  synchronous 
circuits,  the  objective  in  asynchronous  design  is  to  attain  the  best  possible 
average  performance  and  to  utilize  this  potential  performance  advantage  at  the 
architectural  level. 

We  have  designed  a  serial-parallel  multiply-accumulateunit  that  exploits  this 
performance  advantage.  The  unit  is  designed  to  be  part  of  a  large  ring  network 
of  units  performing  vector-matrix  multiplications.  As  the  system  contains  a 
large  number  of  these  multiply-accumulate  units,  we  choose  the  area-economic 
serial-parallel  approach.  Further  we  want  the  design  to  take  advantage  of  the 
fact  that  a  large  percentage  of  the  elements  in  the  matrix  are  small  integers, 
with  zero  as  a  special  case.  The  result  is  a  flexible  multiply-accumulator  with 
performance  proportional  to  the  bit  length  of  the  serial  input  multiplier. 

The  design  has  been  implemented  as  a  delay-insensitive  circuit,  i.e.  the 
functional  correctness  is  independent  of  any  delays  in  circuit  elements  as  well 
as  wires  —  except  for  certain  wire  forks,  called  isochronic  forks,  for  which  we 
assume  that  the  difference  in  delays  in  the  branches  of  the  fork  are  negligible  [4], 
This  kind  of  circuits  constitutes  a  sub-class  of  the  class  of  asynchronous  circuits. 

The  paper  describes  the  design  and  implementation  of  the  multiply-accu¬ 
mulate  unit  using  the  method  and  tools  developed  at  Caltech  [4]  for  design 

*The  research  described  in  this  report  was  sponsored  by  the  Defense  Advanced  Research 
Projects  Agency,  DARPA  Order  number  6202;  and  monitored  by  the  Office  of  Naval  Research 
under  contract  number  N00014-87-K-0745. 

^  On  leave  from  Department  of  Computer  Science,  Technical  University  of  Denmark. 
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of  delay-insensitive  circuits.  With  the  use  of  the  method,  which  consists  of  a 
sequence  of  transformations  to  a  circuit  description,  the  designer  goes  through 
the  following  steps: 

Algorithm 

CSP-specification 

U- 

Handshake  expansion 

Production  Rules 

it 

Layout 

The  transformations,  which  are  performed  at  each  level,  are  supported  by 
interactive  design  and  analysis  tools  from  the  handshake  expansion  level  down. 

The  purpose  of  this  paper  is  twofold:  To  present  a  delay-insensitive  serial- 
parallel  multiplier  and  to  illustrate  the  full  course  of  design  from  a  high-level 
description  to  fabrication  on  a  non-trivial  example. 

The  description  of  the  method  should  be  seen  as  an  attempt  t.o  give  the 
reader  insight  in  the  design  of  a  full  scale  delay-insensitive  circuit  from  top  to 
bottom;  it.  is  not  a  complete  presentation  of  the  design  method  (for  this  we  refer 
to  [4]).  ' 

We  have  fabricated  an  eight-bit  prototype  of  the  multiply-accumulate  unit 
in  2 p.  CMOS.  Because  of  the  delay-insensitivity  the  chip  is  very  robust  towards 
variations  in  operating  conditions.  At  room  temperature  and  5  volts  the  chip 
has  a  cycle  time  of  27  nsec  for  a  one  bit  serial-parallel  multiplication .  The  design 
is  scalable  to  wider  word  sizes  without  loss  of  performance. 


2  Algorithm  — *  CSP-specification 

The  algorithm  is  inspired  by  previous  work  on  a  digital  artificial  neural  network 
engine  [6].  The  architecture  of  this  network  is  based  on  a  systolic  ring  network 
proposed  by  Rung  and  Hwang  [2].  In  this  architecture  the  neural  computations 
are  performed  as  consecutive  vector-matrix  multiplications,  where  the  vector 
represents  the  state  of  the  neural  network  and  each  element  in  the  matrix  repre¬ 
sents  the  weight  of  a  connection  between  two  neuron  processors.  A  zero  weight 
represents  “no  connection”. 

So,  the  core  of  the  neural  computation  is  to  perform  a  vector-matrix  multi- 
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plication: 


P  -  A x  R  or  .  =  .  .  II  (1) 

V  Pn  J  V  An 2  ■  ■  ■  Ann  J  \  Bn  J 

This  multiplication  is  performed  by  arranging  a  set  of  multiply-accumulate  pro¬ 
cessors  in  a  ring  network.  To  each  processor  is  attached  a  local  memory  con¬ 
taining  one  row  of  the  matrix.  The  vector  elements  are  distributed  in  the  ring 
—  one  for  each  processor  —  and  circulated  among  the  processors  during  opera¬ 
tion.  The  task  for  each  processor  is  to  calculate  one  inner  product  of  the  result 
vector,  i.e.  for  processor  r. 

l v 

/■V  V-  V]  AnB:  (2) 

J=1 

We  have  chosen  the  serial-parallel  approach  for  the  implementation  of  this 
computation  for  the  following  reason:  Assuming  bit  lengths,  m  and  n,  of  the 
multiplicand  and  the  multiplier  respectively,  the  size  of  the  iterative  multiplier 
is  0(m- h  n)  in  contrast  to  0(mn )  for  a  combinational  counterpart.  This  area 
consideration  becomes  of  interest  already  for  bit  lengths  of  ten  to  twenty.  The 
final  system  of  neuron  processors  will  consist  of  many  hundreds  of  identical 
processors  in  which  the  multiplier  is  a  principal  component.  The  size  of  the 
multiplier  does  therefore  greatly  influence  the  size  of  the  full  system. 

The  summation  in  (2)  is  expanded  to  a  serial-parallel  implementation: 


j- 1 k=0 


where  Atj  -  J2k=o 

From  this  formula  we  can  formulate  a  CSP-specification  for  processor  i , 
1  <  i  <  N: 

PROC[i]  =  (sum  0\  j  :=  1 ; 

*1  j  <  N — +  k  :=  O', 

*t  k  <  n  — >  sum  :=  sum  +  Sk  aij^Bj ; 
k  :=  k  T  1 

]; 

j  ■—  j  + 1 

]; 

Pi  :=  sum 

) 
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Next,  we  decompose  the  processor  specification  into  a  process  controlling  the 
loop  indices,  j  and  k,  (called  “ ENVIRONMENT ’)  and  a  process  performing  the 
computation  (called  “ MAP ’): 

ENVIRONMENT  =  {  PlPf,  j:=l\ 

*[  j  <  N  — ♦  B'.Bj] 
k  :=  0; 

*  [  k  <  n  — *  A\dijk\ 

k  :-k+  1 

]; 

3  ■=  3  +  1 

]; 

P^.Pi 

) 

MAP  =  *[[  B  — *  Bibb 

D  A  — »  AIa'\  sum  sum  +  a  ■  bb\  bb  2  ■  bb 
D  P  — *•  Plsurn;  sum  0 

]] 

All  internal  variables  in  MAP  except  a'  are  integers;  a'  represents  a  single  bit 
of  the  multiplier  A.  The  initialization  of  sum  to  zero  is  done  by  an  initial 
communication  on  channel  P.  It  is  noted  that  the  multiplication  with  2*  in 
MAP  is  performed  by  multiplying  bb  with  2  after  each  accumulation. 

As  the  two  variables,  sum  and  bb,  occur  in  the  expressions  that  are  assigned 
to  themselves,  it  is  necessary  to  introduce  extra  variables  to  hold  the  values 
during  the  assignment: 

MAP  =  *[[  ~B  — »■  Bibb 

Q  A  — »  Ala',  b  bb,  acc  :=  sum :  sum  acc  +  a'  ■  b ,  bb  :=  2  ■  b 
D  P - *•  P\suin;  sum  :=  0 

]] 

This  specification  of  MAP  is  now  in  a  form,  that  can  be  implemented,  but 
first  we  discuss  a  possible  implementation  of  the  environment. 

The  environment 

Even  though  our  focus  is  on  the  implementation  of  the  multiply-accumulate 
unit  it  is  necessary  to  consider  how  the  process  constituting  the  immediate 
environment  for  the  unit  may  be  implemented.  For  simplicity  we  want  to  avoid 
using  counters  to  keep  track  of  the  loop  indices.  Furthermore  the  control  of  the 
loop  indices  is  local  to  each  processor,  whereas  the  values  of  the  multiplicand, 
Bj,  and  the  inner  product,  Pi,  are  send  from  and  to  the  main  ring  of  processors. 
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This  suggests  that  the  control  of  the  accumulator  be  stored  together  with  the 
multiplier  bits  in  the  local  memory. 

With  this  scheme  we  are  able  to  utilize  variable  bit  length  multipliers.  The 
multiplication  may  be  interrupted  and  the  next  started  when  the  remaining  most 
significant  bits  in  the  multiplier  are  all  zero.  In  this  way  the  actual  computation 
time  for  a  multiplication  is  between  zero  (for  multiplication  with  zero)  and  n 
cycles.  For  an  even  distribution  of  numbers  the  average  time  is  n  —  1.  For 
many  applications  in  artificial  neural  networks,  the  distribution  is  not  even, 
but  with  a  bias  towards  small  numbers.  Especially  for  a  large  class  of  neural 
net  configurations  more  than  half  of  the  numbers  in  the  matrix  will  be  zero. 
It,  is  shown  in  [7]  that  an  asynchronous  implementation  is  well  suited  t,o  take 
advantage  of  these  properties. 

A  possible  description  of  the  environment,  where  the  control  of  the  accu¬ 
mulator  and  the  communication  of  the  multiplicands  and  inner  products  are 
separated,  is: 

*[  MEM?W]  [w  <  1  — >  Alw  D  w  =  2  — »  NB  0  w~  3  — -J(  ]] 

I  I 

*[  RINGIts ;  IB — *B\s  Q  Q — ►  Q?s]  ;  RINGO'.s  ] 

I  I 

*[  Plx.  Q\f(x)  ] 

From  here,  we  concentrate  on  the  implementation  of  MAP.  We  change  the 
specification  of  MAP  to  reflect  that  the  control  of  and  communication  on  the 
B  and  P  channels  are  separated: 

MAP  =  *ttKB  — -  NB  •  Bibb 

D  A  — ►  Ala’,  b  bb,  acc  sum ;  sum  :=  acc  +  a'  ■  b,bb  :=  §  ■  b 

[]  Jl  _*  H  •  P\surn\  sum  :=  0 

]] 

The  environment  and  accumulator  with  communication  channels  are  sketched  in 
Figure  1.  Note  that  the  construction  of  the  first,  environment  process  guarantees 
that  the  three  guards  in  MAP  are  mutually  exclusive. 

Carry-save  versus  ripple-carry  adder 

Before  we  decompose  our  specification,  we  should  decide  whether  the  multiplier 
should  be  implemented  with  a  carrv-save  or  ripple-carry  adder.  The  ripple  carry- 
adder  has  become  a.  popular  example  illustrating  the  benefits  of  asynchronous 
circuits.  This  is  due  to  the  fact  that  with  very  simple  hardware  we  obtain  very 
good  average  performance  —  logarithmic  to  the  number  of  bits  added  [5].  This 
is  a  very  important  property,  when  the  result  is  needed  immediately. 
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For  this  application,  we  do  not  need  the  result  of  each  addition  in  binary 
form,  but  only  the  final  inner  product.  By  using  a  carry-save  adder  we  can  do 
an  addition  in  unit  time.  The  carry  part  of  the  accumulated  sum  needs  only  to 
be  resolved  once  per  vector-matrix  multiplication. 

Computation  time 

The  cycle  times  associated  with  input  of  a  new  multiplicand,  Bj ,  multiplication 
with  an  T-bit  and  output  of  result  P  are  denoted  4,4,  and  tp,  respectively.  The 
multiplication  time  for  an  n  bit  multiplier  is  tmxlit,n  —  4  +  n.  4-  Multiplication 
with  zero  takes  time  imuu, o  —  4-  The  average  multiplication  time  is  tmuit  — 
tb+an  ta,  where  an  denotes  the  average  multiplier  bit.  length  for  the  application. 
The  average  time  used  for  the  whole  task  of  calculating  the  inner  product  is 
tip  =  JV(4  +  anta)  +  (s  -  1)4  +  tp.  Here,  s  is  the  bit  length  of  the  inner 
product;  the  term  (s  —  1)4  is  the  time  needed  to  flush  the  carry  part,  into  the 
sum  part  of  the  inner  product.  This  is  done  by  repeatedly  multiplication  with 
zero  bits. 

By  considering  the  frequency  with  which  each  of  the  three  guarded  com¬ 
mands  will  be  activated  during  operation  it  is  possible  to  optimize  the  imple¬ 
mentation.  The  NB  guard  will  be  activated  once  per  multiplication.  During 
each  multiplication  the  A  guard  will  be  activated  once  for  each  bit  in  the  mul¬ 
tiplier.  This  corresponds  to  the  inner  loop  of  the  ENVIRONMENT  process 
in  section  2.  Finally  the  R  guard  will  only  be  activated  once  after  each  full 
vector-matrix  multiplication. 

Even  if  most  multipliers  are  small  numbers,  the  A  guard  is  the  most  fre¬ 
quently  activated  guard.  Hence,  the  performance  of  this  guarded  command 
should  be  optimized  as  much  as  possible  —  if  necessary  at  the  expense  of  the 
others. 

Decomposition 

The  CSP-specification,  MAP ,  is  decomposed  into  processes  handling  single  bit 
variables  only. 

The  implementation  of  MAP  contains  at  least  m  +  n  —  1  processes,  MA[l], 
0  <  l,  each  containing  a  full-adder: 

MA[l}=  *[[  1Tb — *  NB  •  Bibb 

D  A  — *  Ala,  b  bb,  CIlc ,  CO \ carry ,  acc  :=  sum ; 

sum  :=  SUM  (a,  b,  c,  acc),  carry  CARRY  {a,  b,  c,  acc), 
Bllbb,  BO\b 

D  R  — >  R  •  P\sum ;  sum  0 

]] 


All  variables  in  the  specification  are  booleans.  The  value  of  Bj  (with  zeros 


6 


concatenated  as  most  significant  bits)  is  distributed  with  one  bit  to  each  process, 
i.e.  process  M A[l]  receives  bji.  The  processes  further  produce  each  one  bit  of 
the  accumulated  sum,  P%.  The  value  of  atjk  is  send  to  all  processes  in  parallel. 
The  BO  and  CO  ports  in  each  process  is  connected  to  the  BI  and  Cl  ports  of 
the  next  more  significant  bit  process. 

(  To  accommodate  accumulated  sums  larger  than  the  minimum  limit,  the 
multiply-accumulate  processor  is  extended  with  an  appropriate  number  of  pro¬ 
cesses.  For  the  sake  of  regularity,  these  may  be  chosen  identical  to  the  first 
m  +  n  —  1,  but  they  can  be  simplified  to  contain  half-adders,  as  they  are  only  ac¬ 
cumulating  overflowing  carries  from  the  accumulation  of  products.  The  number 
of  necessary  additional  processes  depends  on  the  application.  ) 

For  reasons  of  efficiency  we  want  to  move  the  communication  of  the  5-bits  to 
happen  in  parallel  with  the  other  communications  in  the  A-guarded  command. 
This  will  improve  the  performance  as  the  unit  will  perform  one  communication 
step  followed  by  an  internal  step,  instead  of  two  communication  steps.  To  make 
this  possible,  it  is  necessary  to  send  the  value  of  Bj  shifted  one  bit  to  the  left, 
i.e.  process  MA[l\  receives  see  Figure  2.  Process  MA[l\  becomes: 

MA[l\  =*ltlW  — >  NB  *  Bibb 

D  A — rA?a,  BHb,  BOlbb,  CIlc,  COlcarry ,  acc  :=  sum: 

sum  :=  SUM(a,  b ,  c,  acc),  carry  :=  CARRY(a,  b,  c,  acc), 

bb  :=  b 

D  R  — r  R  •  Plsurn;  sum  :=  0 

]] 

The  two  ends  of  the  string  of  processes  need  to  be  closed  appropriately.  In 
the  accumulate  cycle  (A  becomes  true)  process  MA[l)  starts  out  by  reading  in 
a  b  and  a  c  from  MA[l  —  1]  and  sending  a  bb  and  a  carry  to  MA[l  +  1  ].  Process 
MA[0]  communicates  with  process  MA[—  1}: 

MA[-1]  =  *[f  M  — >  NB*  Bibb 

D  A — *A,  BO'.bb,  CO\0\  bb  :=  0 
D  R  — »  R 
]] 

The  process  handling  the  most  significant  bits,  MA[M],  M  >  m  +  n—  1  is: 

MA[M]  =  *[[  NB  — >  NB 

D  A  — »  A,  BI?,  CI?c>  acc  sum ; 

sum  :—  EXOR(c,  acc) 

Q  R  — *  R  •  Plsum.;  sxm.  0 

]] 
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In  the  following,  we  will  concentrate  on  the  intermediate  processes  only. 


3  CSP  — >  Handshake  expansion 

The  implementation  of  the  given  specification  into  production  rules  can  follow 
one  of  the  two  strategies. 

•  The  full  handshake  expansion  is  derived  directly  from  the  specification 
including  communication  actions  as  well  as  assignments.  Unfortunately 
the  handshake  expansion  may  become  very  extensive  if  the  combinational 
expressions  are  complicated. 

•  The  assignments  and  message  communications  are  decomposed  into  sepa¬ 
rate  processes,  the  datapath,  which  are  treated  separately.  The  handshake 
expansion  is  then  derived  from  a  CSP  description  including  communica¬ 
tion  actions  only;  the  control  part.  This  yields  the  possibility  to  optimize 
each  of  the  parts  separately  [3]. 

Because  of  the  complexity  of  the  boolean  expressions  for  the  SUM  and 
CARRY  functions,  the  second  strategy  is  used. 

In  the  following  we  derive  a  handshake  expansion  for  MA\l]  through  the 
following  steps: 

1.  Separation  of  the  data  path  from  the  control  part. 

2.  Replacement  of  each  communication  action  with  its  implementation  as 
elementary  actions  on  the  two  boolean  signals  that  constitute  the  channel. 

3.  Reshuffling  of  actions  to  optimize  performance. 

The  final  transformations  of  the  handshake  expansions  before  implementation 
as  production  rules  will  be  treated  in  the  following  section. 

Decomposition  of  data  path 

From  the  CSP  specification,  we  derive  a  specification  of  the  control  part  by  the 
following  steps: 

•  All  message  passing  is  removed  from  the  communication  actions. 

•  All  assignments  are  removed  and  put  into  separate  processes,  leaving  each 
assignment  as  a  simple  communication  action  in  the  specification,  i.e. 

x  y, . .  .]  becomes  *[. .  . ;  D\ . . .]  ||  *  [[13  — ►  x  :=  y\  D}}. 

•  All  communication  ports  are  assigned  to  be  active  or  passive.  We  have 
used  the  approach  that  probed  ports  and  output  ports  are  passive  and 
ports  corresponding  to  assignments  and  input  ports  (which  are  not  probed) 
are  (lazy)  active.  The  choices  are  indicated  with  indices  “P”  or  “A”. 


After  these  steps  we  get  (The  active  end  of  the  channels  are  indicated  with 
dots.): 

MA[l]  =  *a  NB — ►  NBp.BA 

D  A  — *  Ap,  BIa ,  B0P ,  Cl Ai  COp,  TA;  SA,  CA,  BBA 
D  R  — »  Rp  *  P p]  %a 

]] 

The  communication  actions  T,  S,  C ,  and  B  call  the  data  path  processes: 

*[[T  — *  acc  sum ;  Tp]J  II 

*[[5 — »  sum  :=  SUM(a,b,  c,  acc);  S'pJ ]  I! 

*[[d — *  carry  CARRY  (a,  b,  c,  acc)\  Cp]J  II 
*ILBB  — >  bb  :=  b;  BBP ]]  I  I 
*[[Z  — ^  sum  :=  0;  Zp ]] 


Implementation  of  communication  actions 

Each  communication  action  is  replaced  with  its  implementation  as  elementary 
actions  on  two  boolean  signals,  that  constitutes  the  channel,  e.g.  Xi  and  x0  for 
communication  X.  We  chose  from  the  three  four-phase  implementations: 
Active.  T  ,  [^i]  i  I- » [  l%i\  s 

Lazy  active:  [— >Xj) ;  xj;  [z*];  x0{\ 

Passive:  [x<];  lot ;  [— »arf];  xd; 

This  step  yields  the  full  handshake  expansion: 

MA[l\  = 

*[[  nbi — «■[->&*];  60f,nioT;  L~>nbi  A  id ;  b0l,nb0i 

D  ai  ■ — *  (ad';  C- ■«*] ;  ad) ,  ( C— >ii*] ;  6*d;  I  ^d), 

«oT;  CciJ;  ci0l) ,  ( [—■<*] ;  t0  |j  [i»3 ;  i0 1). 

(Iboil]  bo0  T;  [— *6oi] ;  bo0\.) ,  ( [cod ;  cod  I  [-’CO,];  cod); 

(  E  ^  j  3  \  $d ,  f^d  ,  $d)  j  (f- ’cd,  C0  f ,  fed,  cd), 

(f->Md;  bb0 f;  f&id;  &M) 

0  i'i  - — +  fp,] ;  pd,rd;  f-T.-A-ipd;  Pol,r0 1;  [-^d;  z0 1;  Id;];  zol 

]] 


Reshuffling  of  communication  actions 

Reshuffling  is  the  process  of  reorganizing  a.  sequence  of  actions  in  a  handshake 
expansion  without  changing  the  functionality  of  the  sequence.  Reshuffling  is 
primarily  applied  for  performance  reasons  but  it  may  also  add  to  the  simplicity 
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of  the  hardware.  The  transformation  is  essential  to  the  sequencing  of  events  in 
different  processes  and  does  as  such  influence  the  performance  of  the  implemen¬ 
tation  the  most.  It  is  performed  by  the  designer,  but  approximate  performance 
figures  may  easily  be  extracted  by  a  cycle  analysis  tool  for  comparisons  [1], 

The  N  B~  and  /^-guarded  commands  are  not  reshuffled.  The  first  is  as  simple 
as  it  can  be.  For  the  ./^-guarded  command  one  might  be  tempted  to  postpone 
the  actions,  p0l  and  r0[  to  happen  in  parallel  with  z0 |,  but  this  will  change  the 
functionality  of  the  specification:  The  value  of  sum  (from  the  CSP-specification) 
would  be  reset  while  it  is  send  to  the  environment. 

In  each  step  of  the  A  guarded  command  the  communication  actions  are 
specified  to  happen  in  parallel,  independent  of  each  other.  Through  analysis  of 
the  behavior  of  several  interconnected  processes  the  natural  overlapping  of  the 
communication  actions  may  be  established.  It  is  appreciated  that  the  complexity 
of  the  circuit  will  decrease  and  the  performance  increase  if  it  is  implemented  to 
perform  this  natural  order  of  communication  actions  only.  This  means  that  we 
implement  a  stronger  specification  than  the  parallel  operator.  Care  should  be 
taken  that  deadlocks  are  not  introduced  as  communications  to  the  left  a.nd  the 
right  neighbors  overlap. 

[  — ■+  A  — >c*i  A-d,;]  ;  bt01,ci0],i0'\,a0/\\  [6o{Acof3;  6o<,T,  co*,!; 

[6ii  A  ci{  A  ti  A  ;  bi0[,  ci0|,  t0[ ,  a0[;  [— >60,-  A  -ico,-]  ;  bo0i,  C0o[\ 
[->Sj  A  ->Cj  A  —tbbil ;  s0| ,  c0t,  hb0\\  Ac,  A  66,3;  s0|,co|1660|  3 

Further  simplification  is  achieved  by  collecting  signals  which  change  with  the 
same  dependencies  in  one  signal.  We  collect  the  signals  for  BI  a,  Cl  a  and  T  a 
in  I A]  BOp  and  COp  in  0p\  and  Sa,  Ca,  and  BBA  in  XA-  These  collections 
will  be  implemented  in  the  section  about  production  rules. 

The  handshake  expansion  for  the  .4-guarded  command  is  now: 

[  a.i  — +[-'*,•3;  fol.aot;  Ho*];  oa f;  [*,-  A  — ;  i0|,«ol;  °ol] 

[-Tj];  x0 j ;  [xj];  x0[  ] 


4  Handshake  expansion  — >  production  rule  set 

Before  the  handshake  expansion  is  ready  to  be  implemented  as  a  production  rule 
set  we  need  to  perform  a  couple  of  transformations;  stale  assignment  to  ensure 
correct  sequencing  and  guard  strengthening  to  ensure  non-overlap  of  guarded 
commands.  These  steps  are  supported  by  interactive  tools. 

The  production  rule  set  can  be  implemented  directly  after  these  steps. 


State  assignment 

When  we  implement  the  handshake  expansion  as  a  production  rule  set  we  give 
up  the  notion  of  sequencing  using  the  operator.  The  sequencing  of  the  pro¬ 
duction  rules  needs  to  be  specified  explicitly  using  variables  from  the  handshake 
expansion  only.  Therefore,  it  is  necessary  that  all  states  in  the  handshake  ex¬ 
pansion  can  be  distinguished  from  each  other.  If  this  is  not  already  the  case, 
specific  state  variables  must  be  inserted. 

The  problem  arises  in  both  the  A-  and  i?-guarded  commands.  The  initial 
state  cannot  be  distinguished  from  neither  the  point  after  o0i  nor  the  point 
after  p0 1,  rc|,  so  a  state  variable  should  be  inserted  in  each  of  the  two  guarded 
commands.  Several  heuristics  exists  for  the  placement  and  the  performance 
analysis  tools  gives  guidance  to  an  optimal  placement.  Generally  an  optimal 
place  to  change  a  state  variable  is  just  before  a  wait  for  an  input  signal  transition, 
but  several  iterations  may  be  (and  was)  necessary,  as  implementation  issues  at 
lower  levels  may  play  a  role. 

Guard  strengthening 

Finally  it  is  necessary  to  ensure  that  the  three  guarded  commands  cannot  over¬ 
lap.  This  may  be  achieved  either  by  strengthening  the  guards  to  ensure  that, 
the  other  commands  are  finished,  or  by  utilizing  that  communications  on  NB, 
A,  and  R  are  mutually  exclusive. 

We  do  not  want  to  strengthen  the  A-guard  as  it  will  decrease  its  perfor¬ 
mance.  Instead  we  make  sure  the  NB  and  R  channels  are  “held”  as  long  as 
their  corresponding  operations  are  performed.  We  “hold”  the  N B  and  R  chan¬ 
nels  by  making  the  completion  of  the  communications  be  the  last  action  in  the 
guarded  commands.  This  is  true  for  the  NB  communication,  but.  it  is  necessary 
to  move  the  completion  of  the  R  communication  to  happen  simultaneous  with 
the  Z  action  instead  of  the  P  communication. 

A  similar  arrangement  for  the  A  channel  will  cause  a  performance  reduction 
as  it  will  leave  less  time  for  the  environment  to  fetch  the  next  bit.  Instead  we 
strengthen  the  other  guards  to  wait  for  the  A-guarded  command  to  finish. 

The  handshake  expansion  of  M  A[l\  with  necessary  state  variables  and  guard 
strengthenings  is: 

l:  [0,  m  +  n  —  I ] : 

MA[l\  =  *[[  nbi  A  ->u  A  -.s0 — +  [— <6,3;  M,»Mi  [-mi;  A  i,]  ;  b0[,nbo[ 

0  (li  y  [—'^3-1)  20f,  floT,  [^jj(  ^0  f  >  A, 

[i.-Anfli  As];  [“’Oil;  o0  j; 

[->£,■];  lot;  «1;  [*»A->tt];  s0.l 
D  r;  A  ~'U  A  -iSo  — »  [pj;  Po t;  4;  M;  r0 f;  [-<*>,-];  pA; 

[-’■jj];  ^c.T;  4;  l-'WAzi  a  -'u]  ;  zol,»v|. 

]] 


li 


Production  rule  generation 

The  generation  of  production  rules  from  the  final  handshake  expansion  is  straight¬ 
forward.  For  each  signal  transition  it  is  examined,  which  variables  uniquely 
determines  a  precondition  for  the  transition.  The  state  assignment  described 
earlier  guaranties  that  this  is  possible.  We  get  the  production  rule  set  for  the 
control  circuitry: 

Choice  NB: 

nbi  A  ->u  A  ~‘X0  A  — *  AoT>»M 

->nbi  A  bi  -+  M,«M 


Choice  A: 


a,  A  A  ~<u  A  ->x0 

— ► 

i-o  1 1 

i 0  A  Oi 

-» 

Oo  1 

0  0 

— f 

«1 

->at  A  i,-  A  u 

-+ 

l0  1  1 

«oi 

-u0  A  ~>Oi 

-> 

o0\ 

-<o0  A  u  A  ~<Xi 

— > ' 

Xo  1 

X  0 

— <• 

«l 

->u  A  Xi 

X  0  \ 

Choice  R: 

Ti  A  t0  A  —>u  A  ->x0 

A  Pi 

-* 

Pol 

Vo 

—* 

nf 

V 

— 1 " 

roT 

r0  A  ->pi 

-»■ 

Pol 

-'Vo  A  v  A  ->2i 

-»■ 

■2  c 

— ► 

Tj  A  -'ll  A  2f 

— ► 

20  J. 

->D  A 

— ► 

5  Production  rule  set  -4  layout 

The  production  rule  set  needs  to  go  through  a  series  of  transformations  before 
it  is  ready  to  be  implemented  in  layout. 

Bubble  reshuffling 

For  a  production  rule  to  be  implemented  in  CMOS  we  need  to  impose  some 
restrictions  on  the  polarity  of  the  signals  in  a  production  rule: 

♦  All  variables  ill  the  guard  of  an  up  transition  must  appear  in  inverted 
form,  e.g.  -'X  — +  y\ . 


12 


•  All  variables  in  the  guard  of  a  down  transition  must  appear  in  true  form, 
e.g.  x  -»  y  1- 

These  restrictions  arise  from  electrical  properties  of  the  p-  and  n-transistors  in 
the  CMOS  technology. 

It  is  necessary  to  go  through  each  set  of  production  rules  and  change  the 
polarity  of  variables  in  order  to  make  them  meet  the  criteria.  This  process,  called 
bubble  reshuffling,  must  take  into  account  that  internal  signals  may  contain 
isochronic  forks,  i.e.  forks  where  the  difference  in  the  delays  of  the  branches  are 
assumed  be  negligible.  It  is  not.  allowed  to  invert  only  one  branch  of  these  forks. 
Production  rule  sets  occur  which  cannot  be  resolved  in  this  manner,  because  of 
a  cyclic  dependency  between  the  variables.  In  these  cases  it  is  necessary  to  go 
back  to  the  handshake  expansion  and  reshuffle  the  troublesome  events. 

The  whole  procedure  of  bubble  reshuffling  is  automated  with  performance 
analysis.  In  the  cases  of  cyclic  dependencies  the  particular  variables  are  pointed 
out. 


Transistor  sizing 

Transistors  are  sized  in  order  to  increase  performance.  We  concentrate  primarily 
on  the  sizing  of  the  production  rules  for  the  A-guardecl  command.  Further  we 
keep  the  load  from  the  other  guarded  commands  on  the  shared  signals,  x0  and 
■u,  small. 

Given  bounds  on  smallest,  average  and  largest  transistor  width  the  transistor 
sizing  is  performed  automatically  to  repeatedly  optimize  the  critical  cycle  of 
events  in  the  circuit  [1]. 


Layout 

Layout  is  automatically  produced  from  the  final  sized  production  rule  set.  The 
up  and  down  transitions  of  each  variable  are  collected  in  a  cell.  If  the  cell  is 
not  combinational  a.  “staticizer”  (a  weak  feedback  loop)  is  added  to  the  output. 
The  cells  are  automaticly  placed  and  routed,  see  Figure  3.  The  total  transistor 
count  for  the  control  part  is  129. 

6  Data  Path 

The  production  rules  and  transistor  network  for  the  combinational  expressions 
and  the  communication  ports  have  been  designed  by  hand.  We  have  used  the 
standard  communication  ports  as  they  have  been  derived  in  [3].  All  bit  variables 
are  represented  in  dual  rail.  At  an  input  port  the  dual  rail  variable  is  input  into 
a  register,  which  content  is  stable  at  the  time  of  use.  This  requires  that  the 
register  acknowledges  when  the  variable  has  been,  input,  Figure  4: 
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xl  —*  zO  l  xf  xl  I 

-i xO  — *  xl  |  -ixl  — »  xO  \ 

xO  —*  xl  i  ( feedback  to  staticize  signal ) 
xl  —*■  xO  |  ( feedback  to  staticize  signal ) 

-ixtA~>xf  —*  ack] 

(xt  A  xl )  V  (xf  A  xO)  — ►  aci]. 

In  the  design  of  the  combinational  expressions  it  is  appreciated  that  this  register 
contains  both  the  true  and  inverted  value  as  stable  signals.  For  the  simple  output 
ports  we  have  the  production  rules,  Figure  5: 

xl  A  go  — *•  yt[  xO  A  go  —>  yfl 

igo  ->  yi]  igo  ->  Jf/t 

-ij/  — *  jrf.f  (staticizer) 

-~yt  — ¥  yf]  ( staticizer ) 

In  the  cases  where  combinational  expressions  are  involved  these  has  been 
designed  by  hand  and  incorporated  in  the  output  ports.  The  pull-down  part  of 
the  production  rule  set  for  the  SUM  function  is: 

(  (accl  A  cO  A  (aO  V  bO))  V  (accO  A  cl  A  (all  V  bO))V 

(accl  A  cl  A(al  A  bl))v  {accO  A  cO  A(al  A  bl))  )  A  go  —>  surntj 

(  ( accO  A  cO  A  ( aO  V  bO))  V  ( accl  A  cl  A  ( aO  V  bO))V  _ 

( accO  A  cl  A  ( al  A  bl))  V  (accl  A  cO  A  (al  A  bl)))  A  go  -+  sum}  [ 

These  two  expressions  are  manipulated  and  by  transistor  sharing  the  transistor 
count  is  brought  down  to  14  transistors  for  the  two  expressions  (Figure  6). 

The  pull-down  part  of  the  production  rules  for  C ARRY  are: 

((accl  A  cl)  V  ((c7  V  accl)  A  (al  A  bl)))  A  go  —  carryil 
((accO  A  cO)V  ((cO  V  accO)  A  (aO  V  bO)))  A  go  carry} | 

Layout 

The  data  path  consists  of  seven  registers  and  seven  output  ports  (including 
those  for  assignments).  The  218  transistors  are  laid  out  by  hand,  Figure  7. 


7  Merge  of  control  part  and  data  path 

We  have  implemented  a  prototype  chip  consisting  of  seven  full  adder  cells  plus 
the  two  end  cells.  In  this  design  we  decided  to  decompose  the  accumulation 
into  single  bit  processors  (section  2),  which  means  that  each  bit  process  consists 
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of  a  control  part  and  a  data  path.  The  control  signals  (NB,  A  and  R )  are 
distributed  through  a  short  tree  structured  fifo-queue.  This  structure  ensures 
that  the  performance  is  independent  of  the  length  of  the  accumulator.  For 
other  implementations  alternative  trade-offs  may  be  considered,  for  example 
one  control  unit  for  every  four  bits  or  only  one  for  the  whole  accumulator.  The 
drawback  of  the  last  solution  is  that  the  control  signals  has  to  be  distributed  to 
the  whole  array,  the  computation  performed  and  the  acknowledgement  signals 
collected  within  the  same  cycle.  This  does  not  scale  well. 


8  Evaluation 

An  eight  bit  (4x4bit)  multiply-accumulate  unit  has  been  fabricated  as  a  proto¬ 
type  in  2 n  CMOS  (MOSIS  TinyChip  service).  The  core  of  the  chip  measures 
1830xl80(Vtm  and  contains  3124  transistors.  Each  multiplier  process  consists  of 
347  transistors;  129  in  the  control  part  and  218  in  the  data  path. 

The  chip  is  very  robust  towards  variations  in  operating  conditions.  It  has 
been  tested  successfully  in  the  voltage  range  from  below  0.8  volt  to  above  10  volts 
with  repeated  multiplication  performance  ranging  from  below  100  Kbit/sec  to 
above  58  Mbit/sec.  It  should  be  noted  that  the  accumulator  is  self-adjusting  to 
these  variations  in  operation  conditions;  it  operates  as  fast,  as  it.  can  under  the 
given  conditions. 

The  performance  at  room  temperature  and  5  volts  is: 

Multiplication  with  multiplier  bit:  ta  —  27  nsec.  k.  37  Mbit/sec. 

Input  of  new  multiplicand:  =  37  nsec. 

Output  of  result:  tp  —  30  nsec. 

The  implemented  design  is  scalable  to  wider  word  sizes  without  loss  of  perfor¬ 
mance. 

The  accumulator  operates  with  variable  bit  length  of  the  multiplier  with 
a  performance  for  a  multiply-accumulate  operation  in  the  range  of  37  nsec  to 
n- 27  nsec.,  where  n  is  the  maximum  size  of  the  multiplier  bit  string. 
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Figure  4:  Transistor  diagra: 


Figure  7:  Layout  for  the  datapath. 
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